The use of trigram analysis for spelling error detection

نویسندگان

  • E. M. Zamora
  • Joseph J. Pollock
  • Antonio Zamora
چکیده

Work performed under the SPElling Error Detection Correction Project (SPEEDCOP) supported by National Science Foundation (NSF) at Chemical Abstracts Service (CAS) to devise effective automatic methods of detecting and correcting misspellings in scholarly and scientific text is described. The investigation was applied to 50,000 wordlmisspelling pairs collected from six datasets (Chemical Industry Notes (GIN), Biological Abstracts (BA). Chemical Abstracts (CA), America1 Chem~cai Society primary journal keyboarding (ACS), Information Science Abstracts (ISA), and Distributed On-Line Editing (DOLE) (a CAS internal dataset especially suited to spelling error studies). The purpose of this study was to determine the utility of trigram analysis in the automatic detection and/or correction of misspellings. Computer programs were developed to collect data on trigram distribution in each dataset and to explore the potential of trigram analysis for detecting spelling errors, verifying correctly-spelled words, locating the error site within a misspelling, and distinguishing between the basic kinds of spelling errors. The results of the trigram analysis were largely independent of the dataset to which it was applied but trigram compositions varied with the dataset. The trigram analysis technique developed determined the error site within a misspelling accurately. but did not distinguish effectively between different error types or between valid words and misspellings. However. methods for increasing its accuracy are suggested.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detection is the central problem in real-word spelling correction

Real-word spelling correction differs from non-word spelling correction in its aims and its challenges. Here we show that the central problem in real-word spelling correction is detection. Methods from non-word spelling correction, which focus instead on selection among candidate corrections, do not address detection adequately, because detection is either assumed in advance or heavily constrai...

متن کامل

Real-Word Spelling Correction with Trigrams: A Reconsideration of the Mays, Damerau, and Mercer Model

The trigram-based noisy-channel model of real-word spelling-error correction that was presented by Mays, Damerau, and Mercer in 1991 has never been adequately evaluated or compared with other methods. We analyze the advantages and limitations of the method, and present a new evaluation that enables a meaningful comparison with the WordNet-based method of Hirst and Budanitsky. The trigram method...

متن کامل

Design and implementation of Persian spelling detection and correction system based on Semantic

Persian Language has a special feature (grapheme, homophone, and multi-shape clinging characters) in electronic devices. Furthermore, design and implementation of NLP tools for Persian are more challenging than other languages (e.g. English or German). Spelling tools are used widely for editing user texts like emails and text in editors.  Also developing Persian tools will provide Persian progr...

متن کامل

Cooperative Error Handling and Shallow Processing

This paper is concerned with the detection and correction of sub-sentential English text errors. Previous spelling programs, unless restricted to a very small set of words, have operated as post-processors. And to date, grammar checkers and other programs which deal with ill-formed input usually step directly from spelling considerations to a full-scale parse, assuming a complete sentence. Work...

متن کامل

A simple real-word error detection and correction using local word bigram and trigram

Spelling error is broadly classified in two categories namely non word error and real word error. In this paper a localized real word error detection and correction method is proposed where the scores of bigrams generated by immediate left and right neighbour of the candidate word and the trigram of these three words are combined. A single character position error model is assumed so that if a ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Inf. Process. Manage.

دوره 17  شماره 

صفحات  -

تاریخ انتشار 1981